Examining the Influence of Digital Phantom Models in Virtual Imaging Trials for Tomographic Breast Imaging

Purpose: Digital phantoms are one of the key components of virtual imaging trials (VITs) that aims to assess and optimize new medical imaging systems and algorithms. However, these phantoms vary in their voxel resolution, appearance and structural details. This study aims to examine whether and how variations between digital phantoms influence system optimization with digital breast tomosynthesis (DBT) as a chosen modality. Methods: We selected widely used and open access digital breast phantoms generated with different methods. For each phantom type, we created an ensemble of DBT images to test acquisition strategies. Human observer localization ROC (LROC) was used to assess observer performance studies for each case. Noise power spectrum (NPS) was estimated to compare the phantom structural components. Further, we computed several gaze metrics to quantify the gaze pattern when viewing images generated from different phantom types. Results: Our LROC results show that the arc samplings for peak performance were approximately 2.5° and 6° in Bakic and XCAT breast phantoms respectively for 3-mm lesion detection task and indicate that system optimization outcomes from VITs can vary with phantom types and structural frequency components. Additionally, a significant correlation (p¡0.01) between gaze metrics and diagnostic performance suggests that gaze analysis can be used to understand and evaluate task difficulty in VITs. Conclusion: Our results point to the critical need to evaluate realism in digital phantoms as well as ensuring sufficient structural variations at spatial frequencies relevant to the signal size for an intended task. In addition, standardizing phantom generation and validation tools might aid in lower discrepancies among independently conducted VITs for system or algorithmic optimizations.


I. INTRODUCTION
Rapid advances in medical imaging technologies and methods make it impossible to evaluate and optimize several emerging and competing systems and algorithms using clinical imaging trials.These trials can take extensive resources and long duration.Virtual imaging trials (VITs) are alternative approach to assess the potential of an imaging system, software or specific system/software combinations.[1][2][3][4][5][6][7][8] VITs are based on in-silico methods, where digital phantoms replace patients, software platform mimics imaging process, and virtual interpreters represent the human readers.
As VITs are becoming more accurate, realistic digital phantom development has drawn much attention from the medical imaging community.Our group recently examined the contribution of anatomical and quantum noise in signal detection and performance for human observers in such imaging trials [9,10].Using VITs, we have also examined in the past, the development of novel visual search observer models to match humans [11][12][13], system and algorithm optimization questions [3,4,[14][15][16][17], understanding image texture features as relevant to human observer performance [18,19], and radiomics variability [20].As in other group's work, all of these studies were conducted with one breast phantom type.
Here we will evaluate what implications a VIT study outcome may have by changing the phantom type.We once again take DBT as the modality of choice to examine this critical question and use two widely accepted types of breast phantoms -both considered anthropomorphic.For the last few years, many computational breast phantoms have been developed using different methods.[21] Some commonly used breast phantoms include power-law based phantoms [22][23][24][25], anthropomorphic phantoms [26][27][28][29][30][31], modified patient tomographic data [32][33][34][35][36], and mastectomy specimens [37][38][39].These phantoms vary in their resolution, model, structures, and details.Power spectrum analysis has been used as a method to assess phantom structures and realism.[27-30, 38, 40, 41] Cockmartin et al [41] observed in 2013 that none of the evaluated phantoms matched with the patient data in terms of power-spectrum parameters.Few studies extended the realism assessment with trained human's (physicists and radiologists) rating.[27,30,32,34] Badano [42], however, argued that realism is subjective and simulating relevant properties for the task is sufficient.But it is unclear which properties are relevant and what level of realism is sufficient for the task of tomographic imaging system/algorithmic optimizations.
Past work by various groups have used different phantom types for evaluating DBT system configurations [22,37,43,44].Because of the differences in phantoms, configurations, and interpreters among the studies, there is little agreement on the optimal configuration.Specifically, the contribution of phantom differences in these inconsistencies is not well understood.In preliminary analysis, Park et al. [45] and Zhao et al [46] observed that phantom differences (uniform vs structured) influence tomographic system optimal configurations.However, differences resulting from various available structured phantoms (when they are each considered realistic by the research community) has never been explored.
We explore this critical question and examine whether and how phantom structure variability would influence the study outcome of a VIT such as for DBT acquisition parameters.
In addition, we also investigated the influence of task difficulty due to change in the phantom structures using analysis of observer gaze pattern and interpretation process when viewing images generated with different phantoms in the VIT studies.We present results from our study examining the influence of breast phantom types in VIT for DBT when all other aspects remain the same.To the best of our knowledge, ours is the first study to evaluate this critical aspect.
To accomplish this, we selected two widely used and open access breast phantom types which were generated using different procedures.Further, human observers analyzed simulated in-plane DBT images of the selected phantoms for similar configurations.Our results show a comparison of predicted system optimizations between VITs using these two phantom types.

II. MATERIALS AND METHODS
The study methods are summarized below.First, the selected phantoms are listed.Second, our simulation methodology is described to generate both abnormal and normal cases of DBT images.Next, the experimental method to estimate human observer performance and gaze pattern is discussed.Finally, the procedure to characterize phantom structures using a power spectrum is described.

A. Phantom Selection
We selected two types of digital breast phantoms for this study: three-dimensional anthropomorphic breast models generated by Bakic et al. at the university of Pennsylvania [26] (hereafter referred to as Bakic phantom) and XCAT breast phantoms generated at Duke university using compressed volumes of patient breast CT data [33] (hereafter referred to as XCAT breast phantoms).Bakic phantoms are based on mathematical models which define the breast structural variability.These phantoms can be manipulated easily through the model configurations to simulate changes in anatomy.XCAT breast phantoms are based on patient breast CT images.These phantoms may appear more realistic, but changing their anatomical variations or resolution is difficult [7].For our analysis, six digital phantoms of 5-cm thickness for each type were selected.The phantoms of each type were categorized based on volumetric glandular fractions (VGFs) with 25% density and other three with 50% density.Bakic and XCAT breast phantoms have a voxel resolution of (0.2mm) 3 and (0.25mm) 3 respectively.The selected case numbers of XCAT phantoms of approximately 25% density are CTA1608, CTA0357, and CTA1326 and of approximately 50% density are CTB6013, CTA1284, and CTA1285.
Figure 1 shows sample slices of both types of phantoms on the top.The transitions in XCAT breast phantoms from 100% glandular to adipose tissue have intermediately dense (25%,50%, and 75%) voxels while these transitions are sharp (100% dense to adipose) in Bakic phantoms.The XCAT breast phantoms lack small structures such as cooper's ligaments.Erickson et al. [33] noted that the "lack of very fine-detail structures like cooper's ligaments can negatively impact the realism of digital phantoms".References [34,35,47] tried different methods to improve XCAT breast phantoms, but these phantoms were not released publicly.

B. Image Generation
The DBT images used in this study were generated using a simulation platform based on serial cascade model [48,49].The simulation platform modeled generic DBT systems with both source and detector rotating geometry, which was detailed in our previous work [9,49,50] and is described briefly here.The x-ray spectrum modeled a 30-kVp molybdenum anode source with 0.7-mm thick Al filter and the x-ray fluence scaled to provide a 1.5 mGy mean glandular dose (MGD) to a 5-cm-thick compressed breast.This total dose was evenly distributed at each projection angle and the x-ray fluence per projection was determined based on the breast dosimetry data (Dgn coefficients) generated by a Monte Carlo simulator [51].Focal-spot blurring with a 0.1-mm focal-spot size was modeled using a Gaussian modulation transfer function.The detector was modeled as a 0.1-mm thick CsI based a:Si flat panel detector with a 0.1-mm pixel size.The scintillator blurring was modeled using an empirically measured pre-sampling MTF.Quantum noise was modeled by a Poisson distribution for each keV (at the absorption of x-rays with in the scintillator), while additive electronic noise followed a Gaussian process with a standard deviation of 2200 electrons.Scatter was not modeled in this simulation.
The lesion targets were homogeneous spherical masses with a 3-mm diameter.While this is smaller than the average lesion sizes detected by the current DBT systems, VITs are also aimed at evaluating future imaging system designs.This signal size of 3-mm was also chosen to achieve sufficient task-challenge without requiring to artificially alter attenuation values of the signal and to the capture the human observer performance trends more accurately for a range of breast densities.An abnormal case was generated by substituting the lesion into the randomly selected location in the glandular region prior to the projection imaging.The lesion contrast or local glandularity were not matched between phantom types while selecting lesion locations as absolute performance was not relevant for this study.The simulated mass was assumed to have an energy-dependent attenuation coefficient for invasive ductal carcinoma as reported by Johns and Yaffe.[52] Eight abnormal cases and one normal case were formed for each phantom.The projections were acquired using Siddon's ray tracing method [53] to model x-ray transmission through the breast.Two different sets of projections were generated with different phantom types.In order to filter the random noise, an adaptive Wiener filter based denoising algorithm was applied on each projection.[54,55] The denoising algorithm was shown to reduce the noise effectively in our prior work [9].Each data set was acquired over an angular span of 60 • with projection number P ∈{3, 7,11,15,19,21,25,31,35,41, 45} by keeping the total dose steady at 1.5 mGy for each DBT acquisition.Feldkamp filtered back-projection (FBP) algorithm [56] was used for image reconstruction.A threedimensional Butterworth filter with a cutoff of frequency 0.25 cycles/pixel was applied on reconstructed volumes.In-plane DBT images of 1-mm thickness were produced by applying a boxcar averaging.Eight lesion-present (abnormal) and eight lesion-absent (normal) images were created for each phantom for the human observer studies as described in our earlier work [4].A set of 96 images were produced from the six phantoms of each phantom type for a given projection number.Figure 1 shows sample 1-mm abnormal DBT slices of both types of phantoms on the bottom.

C. Human Observer Study
Three non-radiologists took part in the Localization ROC (LROC) experiments.An LROC study entails both detection and localization of a 3-mm spherical lesion.The task in our study is search and localization/detection of spherical mass in simulated in-plane DBT images thus justifying the non-radiologists as observers.Physicists and engineers who participated as observers had the same level of experience in reading simulated images.
The human observer experimental method was same as that described in our prior work [3,4,11,12] and summarized briefly here.In this study, in-plane DBT slices of the Bakic image sets.Observers were asked to select the lesion location, and a four-point ordinal scale was used to collect the confidence rating.Localizations were considered as correct when the observer selected the location within a 2-mm radius of lesion center (radius of spherical lesion + 0.5mm additional radius for human selection error).The observers performance was quantified with area under LROC curves (AUC).The estimate of AUC for a given observer and protocol was obtained with a Wilcoxon-based non-parametric ranking method.In order to assess the consistency of the AUC values between the observers, the intraclass correlation coefficient (ICC) was calculated using Python (v3.7.6) software (www.python.org).

D. Eye Gaze Collection
Numerous studies have demonstrated that gaze metrics correlate with diagnostic performance and can reveal human observers interpretation process [57][58][59][60].In order to evaluate the differences in the interpretation process due to the change of phantom type, we collected gaze data using a screen based eye-tracking system (Tobii pro X3-120 with EPU ).An additional three more non-radiologists participated for the eye tracking study.The eye gaze data was collected for a single acquisition protocol of 35 projections for both type of phantoms.Both presentation of stimuli and eye-tracking were controlled by a Lenovo thinkpad p52s laptop using an in-house built Python software.A calibration and validation procedures were performed before starting a session.The images were displayed on a standard Dell 23.8 inch LCD monitor with a resolution of 1920 × 1080.Once the software collects and stores the gaze logs, post processing was applied to estimate fixation locations, fixation durations, and saccade durations using an I-VT filter based on Tobii pro white paper [61].We computed five gaze metrics to characterize the gaze pattern.Namely, we found total time spent on each image, total number of fixations made on each image, time taken to first fixate on the lesion region (first hit time), number of fixations on lesion region, and accumulated lesion dwell time.These gaze metrics were estimated using MATLAB2018a (The MathWorks, Inc.) with in-house built scripts.Spearman rank correlation was estimated between the average values of each gaze metric and the corresponding AUC values of the six observers.

E. Noise Power Spectrum (NPS)
Noise power spectrum (NPS) based analysis has been used in literature to quantify the similarity between phantom structures and clinical data.[28,41].Our NPS estimation methods are briefly summarized here and can be found in detail elsewhere [9].We selected multiple regions of interest (ROIs) of size 2.43cm×2.43cmfrom each lesion-absent DBT slice from the breast region.The mean of each ROI was subtracted from the corresponding ROI and a Hann tapering window was applied to each ROI to reduce the edge artifacts.The 2D NPS was calculated by ensemble averaging the square of the magnitude of the discrete Fourier transform of each tapered ROI , and radial averaging of the 2D NPS was performed, resulting in a 1D NPS.A linear regression fit (from lower frequency ranges between 0.1 and 0.15 mm −1 to higher frequency ranges between 0.4 and 0.7 mm −1 ) was estimated to the natural logarithm of the 1D NPS that maximizes the coefficient of determination (R 2 ).The NPS parameter β is estimated as the slope of linear fit.

A. Human Observer LROC Results
In our studies, each human observer was tasked with search and localization of a 3-mm lesion within the displayed DBT image slice (similar to the ones shown in fig 1). Figure 3 shows sample regions of DBT slices with lesion at the center of region to illustrate the changes in the visibility of the signal for varying number of projections in both types of phantom backgrounds.The signal is better visible in Bakic phantom background for the acquisition configuration of 35 projections, whereas the signal is better visible in XCAT breast background for the acquisitions of 11 to 35 projections.Figure 4 presents LROC plots for the three observers, for a sample acquisition of 35 projections over 60 • arc span, in both phantom backgrounds.The y-axis of the LROC represents the joint probability of correctly localizing a lesion in a case reported as positive.Therefore, the curve reaches up to the percentage of cases with correct lesion localization.The LROC AUC values above zero are considered as better than guessing as the likelihood of guessing the lesion's location is zero.Figure 5 shows the average performance of the three observers in both Bakic and XCAT breast phantom images.Error bar lengths indicate twice the standard error of the three observers' AUC values.We observed a greater improvement in performance up to 11 projections in XCAT breast phantom backgrounds and a steady performance thereafter.This corresponds to an arc sampling of approximately 6 • between adjacent projections for peak performance.In Bakic phantom backgrounds, observers' performance improved up to 25 projections and required finer arc sampling of approximately 2.5 • to achieve peak performance.We also plotted the detection performance separately for 25% dense and 50% dense slices, where 25% dense indicates easy level of task difficulty while 50% dense indicates higher level of task difficulty.Regardless of phantom type, both levels of task difficulty show similar trends and suggest that optimization may not change with the task difficulty, which is in accordance with earlier observations by Zeng et al. [44] and Mackenzie et al. [62].We also noticed in figures 4 and 5 that observers had overall slightly lower performance in XCAT breast backgrounds than in Bakic backgrounds in particular for 25% dense images.This difference can not be attributed to phantom structures necessarily as the local densities in the region where lesions are inserted could also influence these LROC AUC.The magnitude of AUC values are less relevant in this particular study as only relative change in AUC values with changing system parameters (such a number of projection in this study) were used to deduce the final conclusions.The inter-observer agreement, quantified using the ICC, ranged from 0.92 to 0.95 with the average AUC values, suggesting a strong agreement between the observers.

B. Power Spectrum Analysis
In order to characterize the structural variations of the selected phantom types, we estimated the NPS of the simulated DBT images.We selected the NPS of in-plane DBT images corresponding to the acquisition configurations of 35 projections over a 60 • arc span for comparing the phantoms' structures.Figure 6 shows the averaged NPS of both the phantom types in log-log scale along with linear fits.These plots show that Bakic phantom backgrounds have higher spectral densities at higher frequencies (considered as the anatomical region) than that of XCAT breast phantom backgrounds.We note that this high anatomic noise is not reflected in the calculated β values.The estimated values of β were 2.52 and 3.32 respectively for Bakic and XCAT breast phantom backgrounds.This result is contradictory to popular belief that lower β values indicate lower anatomical noise and in agreement with observations made in our prior work [9].

C. Gaze Analysis
From the eye-tracking data, the total time spent as well as the total number of fixations made on images (both lesion present and lesion absent images) were estimated.For lesion present images, we also estimated the first hit time (on the lesion), lesion dwell time and 13 the number of fixations on lesion.Each of these gaze metric was averaged across all images for a given phantom type and breast density for each observer.Figure 7 shows the influence of phantom type on two of the gaze metrics plotted separately for lesion absent and lesion present images.Observers spent longer time and made more fixations to diagnose lesion absent images than lesion present images, which is in-line with previous findings [63,64].
The most striking result to emerge from this data is that observers spent longer time and made more fixations on images with the Bakic phantom backgrounds than XCAT breast backgrounds.This difference is significant (p − value < 0.01) in lesion absent images only.
A possible explanation for this difference may be that the greater anatomical noise of Bakic phantom backgrounds than the XCAT breast backgrounds (see Fig. 6) made observers less confident and hence spent longer time to make decision.Spearman rank correlation coefficient was computed between the average values of each gaze metric and the AUC values of both phantom types and densities.Table I shows that all of the estimated gaze metrics have good correlation with diagnostic performance.Our observation of first hit time, lesion dwell time, and number of fixations on lesion showed that the observers took longer to first fixate on the lesion, spent less cumulative time on lesion, and had fewer fixations on the lesion as the task difficulty increased due to higher density.No significant difference was observed in these gaze metrics due to change of phantom type in lesion present images.This result indicates that lesion visibility or task difficulty is a major factor in defining the gaze pattern in lesion present images where as anatomical complexity is the major factor influencing the gaze pattern in lesion absent images.First hit time showed a strong negative (-0.85) correlation with the AUC values as shown in figure 8.This result suggest that the quicker an observer fixate on the lesion the better the diagnostic performance which is in accordance with the observation made by Kundel et al [59].The positive correlation of lesion fixations and dwell time with AUC suggests that observers fixate the lesion longer and multiple times to locate accurately.All the five gaze metrics showed significant correlation (p − value < 0.01) with AUC values.This work evaluated the influence of variations between digital breast phantoms on DBT optimization and interpretation process for a small lesion localization task.Our results indicate that the phantoms should have adequate structures at spatial frequencies that are relevant for the signal size and the intended task for sufficient realism.We observed that optimal number of projections for peak detection performance could change with the structural complexity of the phantoms.In addition, we observed that the number of projections required to achieve maximum performance is smaller for XCAT breast phantoms than for Bakic phantoms.Our power spectrum analysis that the complex structures in Bakic phantoms contribute to high frequencies while the high-frequency content of XCAT breast phantoms showed lower amplitudes.These high-frequency structures resulted in more aliasing artifacts in Bakic phantom backgrounds compared to XCAT breast phantom backgrounds under similar sparse sampling conditions.Hence, more projections were required to resolve these high-frequency structures in Bakic phantom backgrounds.
Our gaze analysis suggest that all the five gaze metrics showed good correlation with diagnostic performance, which is in accordance with the observations made by Voisin at al [60] in mammograpy.This result indicates that gaze metrics and interpretation process were influenced by the task difficulty.Hence, these gaze metrics can be used to understand the task difficulty in VIT imaging.
In this study, we evaluated the optimal number of projections in an arc span of 60 • , which is wider than the arc ranges used in the clinical DBT systems.The arc samplings between the adjacent projections for peak performance were approximately 2.5 • and 6 • in Bakic and XCAT breast phantoms respectively for 3-mm lesion detection task.For other arc spans, we expect similar amount of aliasing artifacts for the similar arc sampling.Thus our conclusions should hold for other arc spans used in clinical DBT prototypes which have not been examined here in our VIT study.
The difference in the optimal arc sampling from VIT conducted with two widely used phantom types, raises the need to standardize and unify the frequency contents and complexity of digital phantoms if results published from multiple groups need to be compared.
The goal of our study was not to determine if one of the two phantoms are favorable or better than the other for use in VIT for breast.It is likely that the perceived realism and correlation with observer performance for VIT with actual clinical imaging trials would show results in favor of one of these phantoms based on the chosen task and the signal type.
One of the limitations of our study is that only 3 observers participated in estimating AUC trends.This number is smaller than the number of observers participated in DBT optimization studies in literature [44,[65][66][67].However, the observers had higher agreement with each other which is backed by our estimation of intraclass correlation coefficient value of greater than 0.92.Secondly, we also estimated correlation between the average AUC of Bakic and XCAT breast phantoms.The spearman rank correlation of 0.2 suggests that the two trends are not similar.
Furthermore, the study was conducted only for the task of searching and locating a 3-mm spherical lesion as conducting human observer studies for different signal sizes and types is time consuming and expensive.Prior studies suggest that smaller (high frequency) lesions dictate the optimal number of projections because their visibility is affected more by both aliasing artifacts (at lower number of projections) and random noise (mainly at higher number of projections) [23,43].Many studies also used 3-mm [22,23,37] and similar sizes [44,[66][67][68] of lesions as a signal target in optimization studies.We believe that for larger signals the differences in optimal arc sampling due to differences in the selected phantom structures may be less evident than that of smaller signals.We chose a sphere lesion as target because detecting a sphere lesion is relatively easy task compared to complex spicule lesion study, hence requiring minimal training.In addition, the performance of non-radiologists and radiologists was shown to be similar for relatively easy tasks [69] and multiple studies have used spherical lesions in virtual imaging trials [22,23,37,43,44].On the other hand, for high contrast smaller signals such as micro-calcification, random noise was shown to be the dominating factor in determining the optimal configuration rather than the background structures.[43] Future studies will include signals of different sizes and types.
Another limitation of our study is that only six phantoms (to generate a much larger number of independent cases in each study set) were selected for each phantom type, this sample may not represent the entire population of each phantom type.Although, our goal was not to compare between Bakic and XCAT phantoms, but to validate how optimization estimation changes with phantoms that were generated differently.One may chose 6 phantoms in their studies in evaluating optimal configurations as some of the studies in literature chose around 2 to 9 phantoms [11,15,[65][66][67].If the selected new phantoms have similar NPS properties, we anticipate the similar differences in the optimal number of projections.Instead, if the selected phantoms have different NPS properties and results in different optimal number of projections, which strengthens our argument that phantom structures influence the estimation of system optimal configurations.The selected phantoms (25% and 50% dense) are in the higher range of breast densities observed in the clinical data.The average AUC trends (see figure 5) did change due to change of breast density, suggesting that AUC trends may remain same for other density phantoms.These results also point to the need to include more phantoms with variations in VIT studies.
In conclusion, our results indicate that both the structural complexity and the relevant spatial frequency magnitudes in digital phantom structures can influence the estimation of optimal system configurations.Our results highlight the importance of accurate modeling of phantoms to resemble the patient anatomy and the importance of assessment for their realism (for chosen tasks) before use in VITs.This results can be generalized to digital phantoms used for multiple imaging modalities.As a final note, our goal in this particular study was not to discuss superiority of one phantom type over the other based on results shown here.The key aspect is for the VIT and medical physics community to be mindful that optimization or other results shown using one "realistic" breast phantom may not always agree with results when using another "realistic" phantom unless additional standardization efforts are pursued.

FIG. 1 .
FIG. 1.A sample slice of 25% VGF Bakic (top left) and XCAT breast (top right) phantoms and the corresponding 1-mm DBT slices with 3-mm spherical lesion on the bottom.
phantom set correspond to five acquisition protocols with projections of 7, 11, 19, 25 and 35 and XCAT breast phantom set correspond to six acquisition protocols with projections of 3, 7, 11, 19, 25 and 35 were evaluated.The 96 images per set (48 pairs of abnormal/normal images) were divided into 72 test images (36 pairs) and 24 training images (12 pairs).All sets included an initial training session followed by a test session.Each observer thus read 10

Figure 2
Figure 2 shows a sample gaze pattern of an observer for the task of searching and locating a 3-mm lesion.Each vertex indicates the fixation center with numbers indicating fixation order and lines indicating the saccade paths and lengths.

FIG. 3 .
FIG. 3. A sample lesion present regions of DBT slices of both phantoms acquired with different number of projections.

FIG. 4 .
FIG. 4. Human observer performance plotted as localization ROC (LROC) curves for a sample acquisition of 35 projections over 60 • in 3-mm mass detection study in Bakic phantom (top) and XCAT breast phantom (bottom) backgrounds.

FIG. 5 .
FIG. 5. Human observer performance plotted area under LROC (AUC) against number of projections in 3-mm mass detection study in Bakic phantom (top) and XCAT breast phantom (bottom) backgrounds.The results suggest the optimal configurations does not change with the task difficulty but changes with the background structure type.

6 .
FIG. 7. The average amount of time spent and the average number of fixations made on images (includes lesion absent and lesion present images) plotted for both phantom types.Observers spent longer time and made more fixations to make decisions on images with Bakic phantom backgrounds in comparison to those with XCAT breast backgrounds.

FIG. 8 .
FIG. 8. Average value of first time, lesion dwell time and number of fixations on lesion wereplotted against AUC value of each observer (total 6) for both phantom types and two densities.
This work was partially supported by funding from the NIH National Institute of Biomedical Imaging and Bioengineering (NIBIB) grant R01 EB EB029761, the US Department of Defense (DOD) Congressionally Directed Medical Research Program (CDMRP) Breakthrough Award BC151607 and the National Science Foundation CAREER Award 1652892.

TABLE I .
Spearman rank correlation between the AUC values of the six observers and the corresponding average values of the five gaze metrics.All five metrics show good and significant correlation with the diagnostic performance.